Illustration by Allison Horst
Taken from R for Data Science
We start with Transform and Visualize with the assumption that our data is in a nice, “tidy” state.
We will work with metadata, mutation, and expression data.frames.
data.frame?Remember that a major theme of the course is about: How we organize ideas <-> Instructing a computer to do something.
With Tidy data, we can ponder how we want to transform our data that satisfies our scientific question.
dplyr lets us do data wranglingIllustration by Allison Horst
dplyr related to the tidyverse?tidyverse is a set of packages for working with datadplyr is one of themggplot2 is anotherreadr loads dataWhen you use:
That loads up the tidyverse packages
library()?You should only have to load packages once in your session. So using library(tidyverse) will load most of everything you need.
dplyr functions| Function Name | Purpose | When |
|---|---|---|
select() |
Selects sets of columns in df | This week |
filter() |
Filters rows in df | This week |
mutate() |
Calculate a New Column in df | Next Week |
group_by()/summarize() |
Calculate summary statistics across groups | Next Week |
arrange() |
Sorts a df by one or more columns | Next Week |
| Function Name | Purpose | When |
|---|---|---|
_join() |
Functions to merge two tables together | Next Week |
|> |
Operation to build pipelines | This Week |
In the dataframe you have here, which rows would you filter for and columns would you select that relate to a scientific question?
✅ Implicit: “I want to filter for rows such that the subtype is breast cancer and look at the Age and Sex.”
🚫 Explicit: “I want to filter for rows 20-50 and select columns 2 and 8”.
Notice that when we filter for rows in an implicit way, we often formulate criteria about the columns.
Here, filter() and select() are functions from the tidyverse package.
filter()metadata_filtered = filter(metadata, OncotreeLineage == "Breast"):
The second argument: a logical indexing vector built from a comparison operator?
But the variable OncotreeLineage does not exist in our environment!
Rather, OncotreeLineage is a column from metadata, and we are referring to it as a data variable. We can directly refer to the column vector metadata$OncotreeLineage with just OncotreeLineage.
filter OutTry filter() for Sex == "Female":
select()The input arguments for select() are also data variables.
select() outAdd OncotreeLineage to the select() statement:
select() works on columnsfilter() works on rowstidyverse functionsBoth filter() and select():
data.frame as inputdata.frame as outputWhen combining multiple functions in one expression, it gets harder to read:
Or, this: 🤨
result2 = function1(function2(function3(dataframe)))
Or… 🤕
result = function1(function2(function3(dataframe, df_col4, df_col2), arg2), df_col5, arg1)
result2 = dataframe |>
function1 |>
function2 |>
function3
result = function1(df_col5, arg1) |>
function2(arg2) |>
function3(df_col4, df_col2)
Rewrite the select() and filter() function composition example using the pipe metaphor and syntax.
When I see pipes, I read them as AND THEN:
graph TD A["metadata<br/>(data.frame)"]
data.framemetadata
graph TD A["metadata<br/>(data.frame)"] --"filter(OncotreeLineage == 'Lung')"--> B["filtered_metadata<br/>(data frame)"]
data.frame into the first function:metadata |>
filter(OncotreeLineage == "Breast")
The output at this point is a data.frame.
graph TD A["metadata<br/>(data.frame)"] --"filter(OncotreeLineage == 'Lung')"--> B["filtered_metadata<br/>(data frame)"] B --"select(ModelID, Age, Sex)"--> C["selected_metadata<br/>(data frame)"]
data.frame into the first function:metadata |>
filter(OncotreeLineage == "Breast")
data.frame, which means we can feed it into our next function:metadata |>
filter(OncotreeLineage == "Breast") |>
select(ModelID, Age, Sex)
data.frame.Look at the output at each step using head() before you move on!
🤠
Build a pipeline that
filter(OncotreeLineage == "Lung")select(ModelID, OncotreeLineage, Age)Try piping the output into head() as you build it up
mutate()group_by()/summarize()_join() functions